Agent

The final policy is used to make the agent, which like the analytical agent, follows some rules.

The agent first looks at all valid moves and checks if any of them is a winning move, and if so, plays it.
Failing which, the agent checks if any of them is a winning move for the opponent in the next round, and if so, prevents it.
Failing which, the agent looks at the recommended move given by the final policy and plays it if it is valid.

final_policy = policies[10]

def rl_agent(obs, config):
    valid_moves = [col for col in range(config.columns) if obs.board[col] == 0]
    winning_moves = [move for move in valid_moves if check_winning_move(obs, config, move, obs.mark)]
    if winning_moves:
        return winning_moves[0]
    losing_moves = [move for move in valid_moves if check_winning_move(obs, config, move, 3 - obs.mark)]
    if losing_moves:
        return losing_moves[0]
    col, _ = final_policy.predict(np.array(obs['board']).reshape(1, 6, 7))
    is_valid = (obs['board'][int(col)] == 0)
    if is_valid:
        return int(col)
    else:
        return random.choice([col for col in range(config.columns) if obs.board[int(col)] == 0])

This is in fact the agent used for all the intermediate agents as well, when training the policies.